In this report; first, we will work with different methods used to express datasets that we express with multidimensional feature vectors as two-dimensional vectors without any loss of information. Refactoring the dataset to two dimensions provides us with great convenience, especially in visualization, since we can observe the cluster structures at a very high level and have information about the data.
Next, we will apply different clustering methods with different hyperparameters such as number of cluster, linkages, etc. to our dataset. We will compare the consistency of our interpretations of the clusters when we visualize them formed as a result of different projection methods with the actual clusters formed by clustering methods.
And finally, by using some clustering evaluation and validation methods,we will evaluate how the cluster structures that we created as a result of the projection and clustering sections are the right choices for clustering our dataset, and we will present you the final version of the cluster structures in our dataset, with a better clustering algorithm and hyperparameters, if any. We will use external, internal, and relative criteria when making this validity assessment.
For the final report, we chose the dataset called Pokemons with stats from the Kaggle (https://www.kaggle.com/abcsds/pokemon).
The dataset stores 800 Pokemons, including their name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed. These statistics are derived from values used in Pokemon games. This dataset consists of 13 columns, 9 of which are numerical (integer), 3 are string and 1 is boolean variable.
setwd("C:/Users/USER/Desktop/R Tutorials/Dataset Selection")
df = read.csv("Pokemon.csv")
head(df)
## X. Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100
## 4 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122
## 5 4 Charmander Fire 309 39 52 43 60
## 6 5 Charmeleon Fire 405 58 64 58 80
## Sp..Def Speed Generation Legendary
## 1 65 45 1 False
## 2 80 60 1 False
## 3 100 80 1 False
## 4 120 80 1 False
## 5 50 65 1 False
## 6 65 80 1 False
Let’s check for missing values.
sum(is.na(df))
## [1] 0
As you can see, there is no missing values.
Now, let’s look at the summary of the data.
summary(df)
## X. Name Type.1 Type.2
## Min. : 1.0 Length:800 Length:800 Length:800
## 1st Qu.:184.8 Class :character Class :character Class :character
## Median :364.5 Mode :character Mode :character Mode :character
## Mean :362.8
## 3rd Qu.:539.2
## Max. :721.0
## Total HP Attack Defense
## Min. :180.0 Min. : 1.00 Min. : 5 Min. : 5.00
## 1st Qu.:330.0 1st Qu.: 50.00 1st Qu.: 55 1st Qu.: 50.00
## Median :450.0 Median : 65.00 Median : 75 Median : 70.00
## Mean :435.1 Mean : 69.26 Mean : 79 Mean : 73.84
## 3rd Qu.:515.0 3rd Qu.: 80.00 3rd Qu.:100 3rd Qu.: 90.00
## Max. :780.0 Max. :255.00 Max. :190 Max. :230.00
## Sp..Atk Sp..Def Speed Generation
## Min. : 10.00 Min. : 20.0 Min. : 5.00 Min. :1.000
## 1st Qu.: 49.75 1st Qu.: 50.0 1st Qu.: 45.00 1st Qu.:2.000
## Median : 65.00 Median : 70.0 Median : 65.00 Median :3.000
## Mean : 72.82 Mean : 71.9 Mean : 68.28 Mean :3.324
## 3rd Qu.: 95.00 3rd Qu.: 90.0 3rd Qu.: 90.00 3rd Qu.:5.000
## Max. :194.00 Max. :230.0 Max. :180.00 Max. :6.000
## Legendary
## Length:800
## Class :character
## Mode :character
##
##
##
Although there are 800 instances in the dataset, there are 721 different IDs because Pokemons, which are advanced versions of each other, are stored with the same ID.
Finally, let’s observe the correlation between the variables.
numeric_df = df[sapply(df, is.numeric)]
corr_matrix = cor(numeric_df)
corr_matrix
## X. Total HP Attack Defense Sp..Atk
## X. 1.00000000 0.11981278 0.09761387 0.10229779 0.09478582 0.08875893
## Total 0.11981278 1.00000000 0.61874835 0.73621065 0.61278743 0.74724986
## HP 0.09761387 0.61874835 1.00000000 0.42238603 0.23962232 0.36237986
## Attack 0.10229779 0.73621065 0.42238603 1.00000000 0.43868706 0.39636176
## Defense 0.09478582 0.61278743 0.23962232 0.43868706 1.00000000 0.22354861
## Sp..Atk 0.08875893 0.74724986 0.36237986 0.39636176 0.22354861 1.00000000
## Sp..Def 0.08581650 0.71760947 0.37871807 0.26398955 0.51074659 0.50612142
## Speed 0.01073349 0.57594266 0.17595206 0.38123974 0.01522660 0.47301788
## Generation 0.98251567 0.04838402 0.05868251 0.05145134 0.04241857 0.03643683
## Sp..Def Speed Generation
## X. 0.08581650 0.01073349 0.98251567
## Total 0.71760947 0.57594266 0.04838402
## HP 0.37871807 0.17595206 0.05868251
## Attack 0.26398955 0.38123974 0.05145134
## Defense 0.51074659 0.01522660 0.04241857
## Sp..Atk 0.50612142 0.47301788 0.03643683
## Sp..Def 1.00000000 0.25913311 0.02848599
## Speed 0.25913311 1.00000000 -0.02312106
## Generation 0.02848599 -0.02312106 1.00000000
library(corrplot)
## corrplot 0.90 loaded
corrplot(corr_matrix)
Let’s just consider the numerical features excluding ID feature to make our data set suitable for PCA, MDS methods and clustering.
numDF = df[c(5,6,7,8,9,10,11,12)]
head(numDF,5)
## Total HP Attack Defense Sp..Atk Sp..Def Speed Generation
## 1 318 45 49 49 65 65 45 1
## 2 405 60 62 63 80 80 60 1
## 3 525 80 82 83 100 100 80 1
## 4 625 80 100 123 122 120 80 1
## 5 309 39 52 43 60 50 65 1
In this part, we used the video: https://www.youtube.com/watch?v=0Jp4gsfOLMs. [1]
We first apply PCA to the dataset with the prcomp() function. Then, we visualize it according to the two principal components with the highest variation.
pca = prcomp(numDF, scale=TRUE)
plot(pca$x[,1], pca$x[,2], xlab="PC1", ylab="PC2", main="PCA")
Let’s draw a Scree Plot to see how much variation each principal component contains.
pca.var = pca$sdev^2
pca.var.per = round(pca.var/sum(pca.var)*100,1)
barplot(pca.var.per, main="Scree Plot", xlab="PC", ylab="Percent Variation")
It’s important to use the summary() function to observe the Cumulative Proportion of Variance of principal components.
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9269 1.0536 0.9934 0.88030 0.84869 0.65380 0.51715
## Proportion of Variance 0.4641 0.1388 0.1234 0.09686 0.09003 0.05343 0.03343
## Cumulative Proportion 0.4641 0.6029 0.7262 0.82310 0.91314 0.96657 1.00000
## PC8
## Standard deviation 1.726e-15
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
In order for the accumulated proportion to exceed 0.9, we must use 5 components.
Let’s create a more visually satisfying graph using the ggplot library.
library(ggplot2)
pca.data = data.frame(X=pca$x[,1], Y=pca$x[,2])
ggplot(data=pca.data, aes(x=X, y=Y, label="+")) + geom_text() + xlab(paste("PCA1 - ", pca.var.per[1], "%", sep="")) + ylab(paste("PC2 - ", pca.var.per[2], "%", sep="")) + theme_bw() + ggtitle("PCA Graph")
Finally, let’s examine which feature is used with which weight in the calculation of the two principal components we use in the visualization since the principal components are actually linear combinations of features.
loading_scores = pca$rotation[,1]
loading_scores = abs(loading_scores)
ld_ranked = sort(loading_scores, decreasing = TRUE)
pca$rotation[names(ld_ranked),1]
## Total Sp..Atk Sp..Def Attack HP Defense Speed
## 0.51853234 0.38980094 0.37971841 0.37801677 0.32945198 0.31314747 0.29027307
## Generation
## 0.03518934
loading_scores2 = pca$rotation[,2]
loading_scores2 = abs(loading_scores2)
ld_ranked2 = sort(loading_scores2, decreasing = TRUE)
pca$rotation[names(ld_ranked2),2]
## Speed Defense Generation Sp..Atk Sp..Def HP
## 0.639859006 -0.571057474 -0.362921024 0.283525992 -0.203158632 -0.104395366
## Total Attack
## 0.014201859 -0.001280817
In the following two parts, we used the video: https://www.youtube.com/watch?v=pGAUHhLYp5Q. [2]
First, we will apply Classical MDS, also known as Principal Coordinate Analysis (PCoA or PCO). We avoid Euclidian distance as distance metric because we know that it would give the same result as PCA if we used it. Therefore we prefer Manhattan distance. We used the cmdscale() function for this. Then, we examined how much variation the first 5 elements had.
dist.matrix = dist(scale(numDF, center=TRUE, scale=TRUE), method="manhattan")
mds = cmdscale(dist.matrix, eig=TRUE, x.ret=TRUE)
mds.var.per = round(mds$eig/sum(mds$eig)*100,1)
mds.var.per[1:5]
## [1] 54.9 14.6 12.9 9.1 7.5
In order for the accumulated proportion to exceed 0.9, we must use 4 components.
Here is the distribution of the data according to the MDS1 and MDS2 after the Principal Coordinate Analysis with Manhattan distance.
mds.values = mds$points
mds.data = data.frame(X=mds.values[,1], Y=mds.values[,2])
ggplot(data=mds.data, aes(x=X,y=Y, label="+")) + geom_text() + theme_bw() + xlab(paste("MDS1 - ", mds.var.per[1], "%", sep="")) + ylab(paste("MDS2 - ", mds.var.per[2], "%", sep="")) + ggtitle("MDS Plot Using Manhattan Distance")
We saw this method in the video of the Youtube channel named StatQuest with Josh Starmer (https://www.youtube.com/watch?v=pGAUHhLYp5Q). Thanks to this method, there was a large increase in the variation of the first two components. Since we needed this increase, we thought it appropriate to try this method. But since there is no such distance method in the dist() function, we had to create these values manually. For this, we created an matrix filled with 0 and then filled its lower triangle.
log2.numDF = log2(numDF)
log2.dist = matrix(0, nrow=nrow(log2.numDF), ncol=nrow(log2.numDF))
for(i in 1:nrow(log2.dist)) {
for(j in 1:i) {
log2.dist[i,j] <- rowMeans(abs(log2.numDF[i,] - log2.numDF[j,]))
}
}
We completed the analysis by applying the same procedures as the previous method.
pco = cmdscale(as.dist(log2.dist), eig=TRUE, x.ret=TRUE)
pco.var.per = round(pco$eig/sum(pco$eig)*100,1)
pco.var.per[1:10]
## [1] 50.6 20.8 12.8 9.0 6.5 5.6 5.2 3.4 2.8 2.5
In order for the accumulated proportion to exceed 0.9, we must use 4 components.
pco.values = pco$points
pco.data = data.frame(X=pco.values[,1], Y=pco.values[,2])
ggplot(data=pco.data, aes(x=X,y=Y, label="+")) + geom_text() + theme_bw() + xlab(paste("MDS1 - ", pco.var.per[1], "%", sep="")) + ylab(paste("MDS2 - ", pco.var.per[2], "%", sep="")) + ggtitle("MDS Plot Using avg(logFC) as the Distance")
We couldn’t find as many resources on implementing this method as others, so our analysis was a bit more limited. We used the R documentation while writing this code. [3]
However, we ran into a problem in implementing this function. Instances with a distance of zero and negative value were preventing the function from working by causing an error. Since we used Euclidean distance, it was impossible for the distance between two samples to be negative. Therefore, we suspected that these distances were zero. When we examined the samples that were said to have a zero distance between them in the original numDF data frame, we noticed that they were duplicates by deleting the non-numeric columns. That’s why we removed the duplicate values before applying this method.
library(MASS)
notDup = numDF[!duplicated(numDF),]
dist2 = dist(scale(notDup, center=TRUE, scale=TRUE), method="euclidian")
iso = isoMDS(dist2, y=cmdscale(dist2,2), k=2, maxit=20, p=2)
## initial value 24.979582
## iter 5 value 20.034201
## iter 10 value 18.062551
## final value 17.807568
## converged
Even though I set the maximum number of iterations to 20, isoMDS() converged to 17.81 between 10th and 15th iterations. This is quite a large value, much smaller would be better for the MDS of the dataset; however, in the above analyzes we have seen that the dataset does not perform very well in multidimensional scaling.
iso.values = iso$points
iso.data = data.frame(X=iso.values[,1], Y=iso.values[,2])
ggplot(data=iso.data, aes(x=X,y=Y, label="+")) + geom_text() + theme_bw() + xlab(paste("ISO1")) + ylab(paste("ISO2")) + ggtitle("Non-metric MDS")
For clustering part, first, we will apply the hierarchical clustering to the dataset. While doing this, we will use different linkage methods. Let’s start with a single-link first.
For this stage, we first scaled our dataset. We created a distance matrix distances using Euclidean distance and we hierarchically clustered the dataset using this matrix and the hclust() function. You can see the dendogram formed as a result below.
scaledNumDF = scale(numDF)
distances = dist(scaledNumDF, method="euclidian")
hierClustMin = hclust(distances, method="single")
plot(hierClustMin, main="Hierarchical Clustering with Simple Link")
As stated in the course materials, single-link hierarchical clustering is very sensitive to outliers and noise. As can be seen from the graphs we drew in the PCA and MDS parts, although the samples in the dataset form a clustered structure, they also take up too much space outside of these clusters. This shows that our dataset actually includes outlier values. Since the dendogram is too cramped and complex to observe with the naked eye, we do not want to focus on this method any more and we move on to the next linkage method.
Next, we implemented hierarchical clustering with complete-linkage.
hierClustMax = hclust(distances, method="complete")
plot(hierClustMax, main="Hierarchical Clustering with Complete Link")
In the graphs we drew in the PCA and MDS sections, we observed that there were 3 to 5 clusters in the dataset. Looking at this dendogram, it seems appropriate to choose 5 as the number of clusters. One of these clusters will consist of the sample at index 231. In this case, we can say that this example is a stand-alone outlier cluster.
Let’s see how our dataset looks when we divide it into 5 clusters. We used the factoextra library for this.
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
hierarchicalClusters = cutree(hierClustMax, k=5)
fviz_cluster(list(data=numDF, cluster=hierarchicalClusters))
In the course materials, it was said that there is a bias towards the globular cluster in full-link hierarchical clustering, and that is why large clusters are divided. It is possible to see this situation in the clustering in the figure. There is very high overlap between clusters. This may be related to the fact that a multidimensional feature vector loses information when reducing to two dimensions, but it is still unsatisfactory in terms of such a visual result.
Afterwards, we implemented hierarchical clustering with group average.
hierClustAvg = hclust(distances, method="average")
plot(hierClustAvg, main="Hierarchical Clustering with Group Average")
Just like in the simple-link, the dendogram of the group average method is not very suitable for choosing the number of clusters. Therefore, we skip this method as well.
Finally, we implemented hierarchical clustering with Ward’s method.
hierClustWard = hclust(distances, method="ward.D2")
plot(hierClustWard, main="Hierarchical Clustering with Ward's Method")
The dendogram of the group average gives us a very clear result compared to the previous 3 dendograms. Therefore, we would like to focus on this dendogram rather than others. As we mentioned before, we observed that it would be appropriate to choose a cluster number between 3 and 5 in our graphs in the PCA and MDS sections. Therefore, let’s observe these two values in this clustering method.
plot(hierClustWard, main="Hierarchical Clustering with Ward's Method")
rect.hclust(hierClustWard, k=3, border = 2:4)
hierarchicalClusters2 = cutree(hierClustWard, k=3)
tapply(numDF$Total, hierarchicalClusters2, mean)
## 1 2 3
## 311.1800 484.1952 632.7882
Since the feature named Total is the sum of the numeric values in all the other features except the feature named Generation, I thought Total feature would give information about the examples in a very superficial way. Therefore, I looked at the mean of the Total feature in each cluster. In this way, I think that the samples in the dataset are divided into clusters of “weak”, “moderate” and “strong” Pokemons as the strengths in the game scores.
Let’s visualize these clusters.
fviz_cluster(list(data=numDF, cluster=hierarchicalClusters2))
There is still a large amount of overlap in the clusters, but it is a more acceptable result than the previous clusters.
Let’s repeat the same process with 5 clusters.
plot(hierClustWard, main="Hierarchical Clustering with Ward's Method")
rect.hclust(hierClustWard, k=5, border = 2:6)
hierarchicalClusters3 = cutree(hierClustWard, k=5)
fviz_cluster(list(data=numDF, cluster=hierarchicalClusters3))
Although it is a representation that shows only 60.3% of the variation of the dataset when reduced to two dimensions, such an overlapped clustering is unfortunately not useful for having an idea about the data.
Before moving on to the k-means clustering method, let’s first draw an elbow plot to choose an appropriate k value and see what the wss value is in each k number.
fviz_nbclust(scaledNumDF, kmeans, method="wss") + labs(subtitle = "Elbow Plot")
As can be seen from the plot, the break point in the elbow method occurs at the point k=2. For this reason, we will start from 2 to try the number of clusters and continue until 6.
set.seed(2021)
kmeansCluster2 = kmeans(numDF, 2, nstart = 5)
fviz_cluster(list(data=numDF, cluster=kmeansCluster2$cluster))
Although the clustering does a pretty good job, we think clustering the data in 2 clusters will create an insufficient number of classes for Pokemon segmentation.
Let’s change the initialization and see how it will affect clustering.
kmeansCluster2_2 = kmeans(numDF, 2, nstart = 25)
fviz_cluster(list(data=numDF, cluster=kmeansCluster2_2$cluster))
When we changed the initialization, we saw that there were very small changes at the surface where the clusters touched each other.
Let’s change the cluster number to 3 and see what was changed.
kmeansCluster3 = kmeans(numDF, 3, nstart = 5)
fviz_cluster(list(data=numDF, cluster=kmeansCluster3$cluster))
We can say that the results are much better understandable than hierarchical clusters.
kmeansCluster3_2 = kmeans(numDF, 3, nstart = 35)
fviz_cluster(list(data=numDF, cluster=kmeansCluster3_2$cluster))
We changed the initialization once again and we could see almost no difference. Therefore, we stop changing the initializations and only examine the effect of the number of clusters on clustering.
Let’s make cluster number 4.
kmeansCluster4 = kmeans(numDF, 4, nstart = 5)
fviz_cluster(list(data=numDF, cluster=kmeansCluster4$cluster))
Let’s try k=5.
kmeansCluster5 = kmeans(numDF, 5, nstart = 25)
fviz_cluster(list(data=numDF, cluster=kmeansCluster5$cluster))
Lastly, we want to try k=6, since we think that the rightmost cluster in the figure should also be divided into two clusters.
kmeansCluster6 = kmeans(numDF, 6, nstart = 25)
fviz_cluster(list(data=numDF, cluster=kmeansCluster6$cluster))
k means clustering did a pretty good job on visualization. Although the most appropriate k value is 2 according to the elbow method, we can work with more clusters.
In the previous part, we could say that the most appropriate choice was to apply k-means clustering and choose the k value of 6. We will now compare this decision with different clustering validation methods and decide on the cluster structure of our dataset with calculations that are more reliable than the human eye.
Although different indexes such as internal, external and relative are specified in this topic in the course materials, we could not find such a clear distinction in R functions and packages. That is why we thought it appropriate to apply some of these methods and metrics at the same time.
We used the clValid library to find the internal consistency values and stability measures. Since we use hierarchical and k-means clustering algorithms and we usually change the number of clusters between 3 and 6, I called the clValid() function with parameters according to these values.
library(clValid)
## Zorunlu paket yükleniyor: cluster
clusterMethods = c("hierarchical", "kmeans")
internal = clValid(numDF, nClust = 3:6, clMethods = clusterMethods, maxitems = 1000, validation = "internal")
## Warning in clValid(numDF, nClust = 3:6, clMethods = clusterMethods, maxitems =
## 1000, : rownames for data not specified, using 1:nrow(data)
summary(internal)
##
## Clustering Methods:
## hierarchical kmeans
##
## Cluster sizes:
## 3 4 5 6
##
## Validation Measures:
## 3 4 5 6
##
## hierarchical Connectivity 7.9155 11.8984 29.2881 49.8254
## Dunn 0.1535 0.1535 0.0673 0.0903
## Silhouette 0.3688 0.3047 0.4145 0.4097
## kmeans Connectivity 41.3099 63.8333 207.1782 211.1107
## Dunn 0.0792 0.0715 0.0478 0.0478
## Silhouette 0.4334 0.4004 0.2850 0.2875
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 7.9155 hierarchical 3
## Dunn 0.1535 hierarchical 3
## Silhouette 0.4334 kmeans 3
Connectivity, Dunn and Silhouette values obtained according to the combinations of clustering methods with different cluster numbers are as above. According to the connectivity and dunn metrics, as you can see, the 3-cluster hierarchical clustering method gives the best result, while the most appropriate clustering algorithm according to the silhouette metric is k-means clustering with k=3.
plot(internal)
Now let’s examine the stability measures by changing the validation parameter of the clValid() function.
stab = clValid(numDF, nClust = 3:6, clMethods = clusterMethods, maxitems = 1000, validation = "stability")
## Warning in clValid(numDF, nClust = 3:6, clMethods = clusterMethods, maxitems =
## 1000, : rownames for data not specified, using 1:nrow(data)
summary(stab)
##
## Clustering Methods:
## hierarchical kmeans
##
## Cluster sizes:
## 3 4 5 6
##
## Validation Measures:
## 3 4 5 6
##
## hierarchical APN 0.0580 0.2291 0.0517 0.0368
## AD 169.3360 168.7530 119.2382 103.7467
## ADM 13.2969 50.5055 20.7552 17.9379
## FOM 36.7039 34.7147 33.3393 32.3068
## kmeans APN 0.0544 0.2306 0.1353 0.1301
## AD 97.9146 97.4735 86.6259 83.1601
## ADM 8.6290 24.4035 15.5246 13.6436
## FOM 25.9650 23.6649 23.4388 23.3290
##
## Optimal Scores:
##
## Score Method Clusters
## APN 0.0368 hierarchical 6
## AD 83.1601 kmeans 6
## ADM 8.6290 kmeans 3
## FOM 23.3290 kmeans 6
If we look at the optimal combinations according to the stability criteria, we see that the k-means cluster method consisting of 3 and 6 clusters is in the majority.
We summarize the result of the calculations with different clustering algorithms and cluster numbers.
vals = c("internal", "stability")
totalSummary = clValid(numDF, nClust = 3:6, clMethods = clusterMethods, maxitems = 1000, validation = vals)
## Warning in clValid(numDF, nClust = 3:6, clMethods = clusterMethods, maxitems =
## 1000, : rownames for data not specified, using 1:nrow(data)
summary(totalSummary)
##
## Clustering Methods:
## hierarchical kmeans
##
## Cluster sizes:
## 3 4 5 6
##
## Validation Measures:
## 3 4 5 6
##
## hierarchical APN 0.0580 0.2291 0.0517 0.0368
## AD 169.3360 168.7530 119.2382 103.7467
## ADM 13.2969 50.5055 20.7552 17.9379
## FOM 36.7039 34.7147 33.3393 32.3068
## Connectivity 7.9155 11.8984 29.2881 49.8254
## Dunn 0.1535 0.1535 0.0673 0.0903
## Silhouette 0.3688 0.3047 0.4145 0.4097
## kmeans APN 0.0544 0.2306 0.1353 0.1301
## AD 97.9146 97.4735 86.6259 83.1601
## ADM 8.6290 24.4035 15.5246 13.6436
## FOM 25.9650 23.6649 23.4388 23.3290
## Connectivity 41.3099 63.8333 207.1782 211.1107
## Dunn 0.0792 0.0715 0.0478 0.0478
## Silhouette 0.4334 0.4004 0.2850 0.2875
##
## Optimal Scores:
##
## Score Method Clusters
## APN 0.0368 hierarchical 6
## AD 83.1601 kmeans 6
## ADM 8.6290 kmeans 3
## FOM 23.3290 kmeans 6
## Connectivity 7.9155 hierarchical 3
## Dunn 0.1535 hierarchical 3
## Silhouette 0.4334 kmeans 3
As you can see in the summary above, 3-cluster hierarchical clustering and 3 and 6-cluster k-means clustering come to the fore among the most optimal combinations. After making a comparison between the scores obtained in the metrics where these combinations are not the most optimal, we decided that the best choice for clustering my dataset is the k-means clustering method with k=3. Therefore, the most optimal clustering is as follows.
fviz_cluster(list(data=numDF, cluster=kmeansCluster3$cluster), main = "K-means Clustering when k = 3")
In the visualizations we made in the PCA and MDS sections, we obtained very close graphs especially in terms of clusters, except for one. In the MDS we performed with avg(logFC), the resulting graph was slightly different from the others. Since we use a two-dimensional plane in all these visualizations, we used the first two elements with the highest variation, but this is not the right approach if we leave the visualization part aside. As stated in the lecture slides, the number of these components (k) should be chosen according to the cumulative proportion of variation ratio exceeding 0.9. Therefore, although we choose the method that has the highest total variation in the first two elements as a result of this report and use its graph to get an idea about clustering, the actual distribution may differ from this. However, many sources say that even when the variation proportions are close to each other and the cumulative proportion of variance of the first two elements does not exceed 0.9, the visualization made by considering the first two elements gives a result close to the real clusters.
Although hierarchical clustering methods did not give good results, the clusters I created with k-means clustering were satisfactory. However, when we look at the x and y axes of the resulting graphs, the fact that the total variation represented in the graph is 60.3% prevents me from having definite ideas about the entire dataset.
pco.var.per[1] + pco.var.per[2]
## [1] 71.4
In the face of this situation, a solution came to our mind. In one of the previous assignments, Projection of Data, we learned how to reduce our multidimensional data to a smaller size. In fact, in the Principal Coordinate Analysis we applied with avg(logFC) distance metric, we reached a 71.4% variation in the first two dimensions. Therefore, we want to see how the k-means clustering method will perform clustering in my projected dataset, where we have reduced the number of dimensions.
kmeansPCO = kmeans(pco.data, 3, nstart=40)
fviz_cluster(list(data=pco.data, cluster=kmeansPCO$cluster), main = "K-means Clustering into 3 Clusters")
In the final paper, we will perform a non-linear projection with ISOMAP and density-based clustering with the DBSCAN algorithm for further analysis.
As a result of the projection, clustering and validation methods we have used so far, we have seen that the most appropriate algorithm and cluster number combination to cluster our dataset is k-means clustering and 3 clusters. Although the new clustering algorithm we will use in the future may result in a different clustering structure, we think that the best number of clusters will still be 3 clusters, because while examining the validity results, we saw that using 3 clusters for both hierarchical and k-means clustering methods is the majority.
In addition, if the projection method we will use contains a variation of more than 71.4% in the first two dimensions, we plan to apply the most appropriate clustering method and number to the dataset with reduced dimensions thanks to this projection method.
To summarize, we think that after our final work, there will still be 3 clusters in the ‘Pokemon with stats’ dataset.
[1] StatQuest with Jost Starmer. (2017,11,7). StatQuest: PCA in R[Video].Youtube. URL https://www.youtube.com/watch?v=0Jp4gsfOLMs.
[2] StatQuest with Jost Starmer. (2017,12,19). StatQuest: MDS and PCoA in R[Video].Youtube. URL https://www.youtube.com/watch?v=pGAUHhLYp5Q.
[3] https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/isoMDS.html